The Icelandic Parsed Historical Corpus (IcePaHC)
نویسندگان
چکیده
We describe the background for and building of IcePaHC, a one million word parsed historical corpus of Icelandic which has just been finished. This corpus which is completely free and open contains fragments of 60 texts ranging from the late 12 century to the present. We describe the text selection and text collecting process and discuss the quality of the texts and their conversion to modern Icelandic spelling. We explain why we choose to use a phrase structure Penn style annotation scheme and briefly describe the syntactic annotation process. We also describe a spin-off project which is only in its beginning stages: a parsed historical corpus of Faroese. Finally, we advocate the importance of an open source policy as regards language resources.
منابع مشابه
Creating a Dual-Purpose Treebank
We describe the background for and building of IcePaHC, a one million word parsed historical corpus of Icelandic which has just been finished. This corpus which is completely free and open contains fragments of 60 texts ranging from the late 12 century to the present. We describe the text selection and text collecting process and discuss the quality of the texts and their conversion to modern I...
متن کاملOn the use of passives across Germanic
Introduction There has long been an intuition that the subject position is deeply relevant to information structural concerns cross-linguistically, particularly because they frequently appear at the left edge of a clause, which is thought to be associated with given/topical information (cf. Vallduvi 1992). This raises interesting questions about the role of the passive construction in the disco...
متن کاملTagging the Past: Experiments using the Saga Corpus
There is an increasing interest in the NLP community in developing tools for annotating historical data, for example, to facilitate research in the field of corpus linguistics. In this work, we experiment with several PoS taggers using a sub-corpus of the Icelandic Saga Corpus. This is carried out in three main steps. First, we evaluate taggers, which were trained on Modern Icelandic, when tagg...
متن کاملRapid Deployment of Phrase Structure Parsing for Related Languages: A Case Study of Insular Scandinavian
This paper presents ongoing work that aims to improve machine parsing of Faroese using a combination of Faroese and Icelandic training data. We show that even if we only have a relatively small parsed corpus of one language, namely 53,000 words of Faroese, we can obtain better results by adding information about phrase structure from a closely related language which has a similar syntax. Our ex...
متن کاملThe HeliPaD : a parsed corpus of Old Saxon
This short note introduces the HeliPaD, a new parsed corpus of Old Saxon (Old Low German). It is annotated according to the standards of the Penn Corpora of Historical English, enriched with lemmatization and additional morphological attributes as well as textual and metrical annotation. This note provides an overview of its main features and compares it to existing resources such as the Deutsc...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012